-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test out processing different partitions concurrently #221
Test out processing different partitions concurrently #221
Conversation
Thanks, it's great to see this being worked on! I think it's best to start by deciding on the overall architecture, based on what the desired capability should be. It looks like the architecture you're proposing has each thread working as a fully independent consumer with no shared state between threads, and each consumer coordinating with the total set of group members using the Kafka protocol. The main benefit here is that you can run multiple consumers on a single host and avoid allocating as much extra memory as you'd otherwise would have – is that a correct assumption? If so, I'm not sure I think the benefit outweighs the added complexity. For one, I suspect that memory use will often be dominated by the workload rather than long-lived application objects – Kafka consumers load tons of messages into memory, and the deserialization process (both at the protocol and application levels) incur even more allocations. So I think further sharing long-lived object allocations has limited practical benefits. If we were to focus on increasing memory sharing, I think a multi-process parent-child forking setup a la Resque or Unicorn would be more reliable. Since there's no shared state anyway, and each thread does its own IO, there's really little benefit to using threads at all. However, if we were to change focus from running entire consumers concurrently to concurrent execution within a single consumer then I think threads make much more sense – but the complexity of implementation increases as well. I think the JVM client does some of this, so perhaps looking into its capabilities and design would be worthwhile. But off the top of my head, I think it could make sense to allow a consumer to process messages concurrently, in the following way:
All in all, I think there’s lots of potential, but it’s not a trivial problem. If the goal is a more immediate increase in memory friendliness, then I think implementing better support for a forking mode is the way to go. |
I'll give the forking approach a go! The assumptions were correct on the approach - these consumers would be independent of each other. |
Will close this and open a separate PR for a process forking version |
The issue: #188
Adds a
ConcurrentRunner
, which creates a thread pool to orchestrate two or more consumerstrap
context. The approach is similar to the one outlined here. The Ruby documentation mentions thatIO.pipe
is not supported on all platforms, however I can't find which platforms these are; worst case it could be replaced with the first example from the above articleWhat hasn't been done yet?
Outstanding question(s):
Couple of other points:
Errno::EMFILE: Too many open files
. This seems to be fixed by upping the number of open file descriptors withulimit -n 1024
. On my machine this was set to256
by default